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Abstract 


Decoding human brain activities via functional 
magnetic resonance imaging (fMRI) has gained in- 
creasing attention in recent years. While encourag- 
ing results have been reported in brain states clas- 
sification tasks, reconstructing the details of human 
visual experience still remains difficult. Two main 
challenges that hinder the development of effec- 
tive models are the perplexing {MRI measurement 
noise and the high dimensionality of limited data 
instances. Existing methods generally suffer from 
one or both of these issues and yield dissatisfacto- 
ry results. In this paper, we tackle this problem by 
casting the reconstruction of visual stimulus as the 
Bayesian inference of missing view in a multiview 
latent variable model. Sharing a common latent 
representation, our joint generative model of exter- 
nal stimulus and brain response is not only “deep” 
in extracting nonlinear features from visual images, 
but also powerful in capturing correlations among 
voxel activities of fMRI recordings. The nonlinear- 
ity and deep structure endow our model with strong 
representation ability, while the correlations of vox- 
el activities are critical for suppressing noise and 
improving prediction. We devise an efficient varia- 
tional Bayesian method to infer the latent variables 
and the model parameters. To further improve the 
reconstruction accuracy, the latent representations 
of testing instances are enforced to be close to that 
of their neighbours from the training set via pos- 
terior regularization. Experiments on three fMRI 
recording datasets demonstrate that our approach 
can more accurately reconstruct visual stimuli. 


1 Introduction 


Brain decoding, which aims to predict the information 
about external stimuli using brain activities, plays an impor- 
tant role in brain-machine interfaces (BMIs). Recent devel- 
opments in this area have shown promising results [Schoen- 
makers et al., 2015; Lee and Kuhl, 2016]. However, most 
previous researches only focus their attention on the predic- 
tion of the category of presented stimulus [Van Gerven et al., 
2010a; Ng and Abugharbieh, 2011; Damarla and Just, 2013; 


Elahe’ Yargholi, 2016]. Accurate reconstruction of the visual 
stimuli from brain activities still lacks adequate examination 
and requires plenty of efforts to improve. Two main chal- 
lenges that hinder the development of effective models are the 
perplexing measurement noise of functional magnetic reso- 
nance imaging (fMRI) and the high dimensionality of limited 
data instances. Existing methods generally suffer from one or 
both of these issues and yield dissatisfactory results. 


Fujiwara et al. has proposed to use Bayesian canonical 
correlation analysis (BCCA) for building the reconstruction 
model, where image bases are automatically extracted from 
the measured data [Fujiwara et al., 2013]. As a latent variable 
model interpretation of non-probabilistic CCA, BCCA as- 
sumes linear observation model for visual images and spher- 
ical covariance for the Gaussian distribution of voxel activ- 
ities. In practice, however, linear observation model for vi- 
sual images has limited representation power, and spherical 
covariance can not capture the correlations among voxel ac- 
tivities. Since the measurement noises are ubiquitous in vox- 
el activities, utilizing the correlations among voxel activities 
would be critical for suppressing noise and improving predic- 
tion performance. 


On the other hand, introducing deep structure into multi- 
view representation learning is attracting more and more at- 
tentions recently [Wang et al., 2015; Chandar et al., 2016]. 
Deep canonically correlated autoencoders (DCCAE), which 
consists of two deep autoencoders and optimizes the com- 
bination of canonical correlation between the learned bottle- 
neck representations and the reconstruction errors of the au- 
toencoders, can extract nonlinear features from both views 
and reconstruct each view by the correlational bottleneck 
representations [Wang et al., 2015]. Nevertheless, DCCAE 
did not consider the cross-reconstruction between two views, 
which limits its effectiveness in applications where a missing 
view needs to be reconstructed from the existing one. To our 
knowledge, no deep multiview learning model with shared 
generative latent representation has been designed specifical- 
ly for missing view reconstruction. 

Focusing on these problems, we present a deep generative 
multiview model (DGMM), where we cast the reconstruction 
of perceived image as the Bayesian inference of the miss- 
ing view. Sharing a common latent representation, DGMM 
allows us to generate visual images and fMRI activity pat- 
terns simultaneously. For visual images, unlike BCCA, we 


explore nonlinear observation models parameterized by deep 
neural networks (DNNs), which can be multi-layered percep- 
trons (MLPs) or convolutional neural networks (CNNs). This 
nonlinearity and deep structure endow our model with strong 
representation ability. For fMRI activity patterns, we adopt a 
full covariance matrix for the Gaussian distribution of voxel 
activities. While the full covariance matrix has the advantage 
of capturing the correlations among voxels, it results in se- 
vere computational issues. To reduce the complexity, we im- 
pose a low-rank assumption on the covariance matrix. This 
is beneficial to suppressing noise and improving prediction 
performance. Furthermore, we devise an efficient mean-field 
variational inference method to infer the latent variables and 
the model parameters. To further improve the reconstruction 
accuracy, the latent representations of testing instances are 
enforced to be close to that of their neighbours from the train- 
ing set via posterior regularization [Zhu et al., 2014]. Com- 
pared with the non-probabilistic deep multiview representa- 
tion learning models mentioned above [Wang et al., 2015; 
Chandar et al., 2016], our Bayesian model has the inheren- 
t advantage of avoiding overfitting to small training set by 
model averaging. Finally, extensive experimental compar- 
isons on three fMRI recording datasets demonstrate that our 
approach can reconstruct visual images from fMRI measure- 
ments more accurately. 


2 Related work 


In the literature of brain decoding, there are a relative- 
ly limited number of studies reporting perceived image re- 
constructions to date. Miyawaki et al. reconstructed the 
lower-order information such as binary contrast patterns us- 
ing a combination of multi-scale local image bases whose 
shapes are predefined [Miyawaki et al., 2008]. Van Ger- 
ven et al. reconstructed handwritten digits using deep be- 
lief networks [Van Gerven et al., 2010b]. Schoenmaker- 
s et al. reconstructed handwritten characters using a s- 
traightforward linear Gaussian approach [Schoenmakers et 
al., 2013]. Fujiwara et al. proposed to build the recon- 
struction model in which image bases can be automatically 
estimated by Bayesian canonical correlation analysis (BC- 
CA) [Fujiwara et al., 2013]. In addition, there are work- 
s trying to reconstruct movie clips [Nishimoto et al., 2011; 
Haiguang Wen and Liu, 2016]. 

Though a similar strategy to our work has been used by 
Fujiwara et al. [Fujiwara et al., 2013] for visual image re- 
construction, its linear observation model for visual images 
has limited representation power in practice. Several recently 
proposed deep multiview representation learning models can 
provide a service to visual image reconstruction [Wang et al., 
2015; Chandar et al., 2016]. For example, deep canonical- 
ly correlated autoencoders (DCCAE) with nonlinear obser- 
vation models for both views has good ability to learn deep 
correlational representations and reconstruct each view using 
the learned representations respectively [Wang et al., 2015]. 
Compared with DCCAE, correlational neural networks (Cor- 
rNet) further considered the cross-reconstruction between t- 
wo views [Chandar et al., 2016]. However, directly applying 
the nonlinear maps of DCCAE and CorrNet to limited noisy 


brain activities is prone to overfitting. 

Inspired by recent developments in deep generative mod- 
els such as variational autoencoders (VAE) [Kingma and 
Welling, 2014], we present a deep generative multiview mod- 
el (DGMM), which can be viewed as a nonlinear extension 
of the linear method BCCA. To the best of our knowledge, 
this paper is the first to study visual image reconstruction via 
Bayesian deep learning. 


3 Perceived image reconstruction with 
DGMM 


In this section, we cast the reconstruction of perceived im- 
ages from human brain activity as the Bayesian inference of 
missing view in a multiview latent variable model. 

Assume the training set consists of paired observation- 
s from two distinct views (X, Y), denoted by (x1, y1),..., 
(xN, yn), where N is the training set size, x; € R”: and 
yi € RP? fori = 1,...,N. Here X € R™*% and 
Y e RP2*N denote the visual images and fMRI activity 
patterns, respectively. The presence of paired two-view data 
presents an opportunity to learn better representations by an- 
alyzing both views simultaneously. Therefore, we introduce 
the shared latent variables Z € R* to relate the visual 
images X to the fMRI activity patterns Y. The shared latent 
variables are treated as the following Gaussian prior distribu- 
tion, 


p(Z) = TI, N (z:10, 1). (1) 


Since the visual image and associated fMRI activity pattern 
are assumed to be generated from the same latent variables, 
we have two likelihood functions. One is for visual images, 
and the other is for fMRI activity patterns. 


3.1 Deep generative model for perceived images 


When observation noises for image pixels are assumed to 
follow a Gaussian distribution with zero mean and diagonal 
covariance, the likelihood function of visual images is 


po(X|Z) = TIM, (xilux(z:), diag(o2(z;))), 2) 


where the mean py(z;) and covariance o2(z;) are nonlin- 
ear functions of the latent variables z;. To allow for sec- 
ond moment of the data to be captured by the density model, 
we choose these nonlinear functions to be deep neural net- 
works (DNNs), which is refer to as the generative network, 
g(z) parameterized by 0. Here the DNNs can be multi- 
layered perceptrons (MLPs) or convolutional neural network- 
s (CNNs). Compared with linear observation model, DNNs 
can extract nonlinear features from visual images and capture 
the stages of human visual processing from early visual areas 
towards the ventral streams [Güçlü and van Gerven, 2015; 
Cichy et al., 2016]. This nonlinearity and deep structure en- 
dow our model with strong representation ability. 


3.2 Generative model for fMRI activity patterns 


fMRI voxels are generally highly correlated, and the cor- 
relation can carry relevant information about stimuli or tasks, 
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Figure 1: Graphical models for DGMM. (a) Directly uses the 
full covariance matrix WY. (b) Imposing a low-rank assump- 
tion on Y (YW = H' H +77tD. 


even in the absence of information in individual voxels [Ya- 
mashita et al., 2008; Hossein-Zadeh and others, 2016]. How- 
ever, most existing methods [Fujiwara et al., 2013; Schoen- 
makers et al., 2013] simply assume a spherical or diagonal 
covariance for the Gaussian distribution of voxel activities 
thus ignoring any correlations among voxels. Unlike them, 
we assume the observation noises of voxel activities follow a 
Gaussian distribution with zero mean and full covariance ma- 
trix. While this difference might seem minor, it is critical for 
the model to be able to suppress noise and improve prediction 
performance. In addition, although nonlinear transformations 
for fMRI activity patterns are more powerful than linear trans- 
formations (in terms of the types of features they can learn), 
extant multi-voxel pattern analysis (MVPA) studies have not 
found a clear performance benefit for nonlinear versus linear 
transformations. Therefore, we assume the likelihood func- 
tion of fMRI activity patterns is 


P(Y|Z) = TIL, N (y:|BTz;, ¥). (3) 

The model should be further complemented with priors 
for the projection matrix B and the covariance matrix Ų € 
IR?2*P2, Popular choices would be automatic relevance de- 


termination (ARD) prior and Wishart distribution for B and 
W—!, respectively, 


p(T) = 115219 (Tilar Br) 
p(B\r) = J152 N (b510, 7; `T) 4) 
p(w") =W (+V, no) 0 


where G (-|a, 8) denotes gamma distribution with shape pa- 
rameter a and rate parameter 8, V and no denote the scale 
matrix and degrees of freedom for Wishart distribution, re- 
spectively. 

While the above model has the advantage of capturing the 
correlations among voxels, it results in severe computational 


issues (the cost is cubic as a function of D2). Fortunately, the 
problem of inferring high-dimensional covariance matrix Y 
can be solved by introducing auxiliary latent variables Z € 


R**N [Archambeau and Bach, 2009], 
p(Z) = TIa N 0,1), 6) 
and rewriting the likelihood function in Eq.(3) as 
PY|Z,Z) = TT, (y:|BT z; + HTZ, yI), © 
where ARD prior and simple gamma prior can be set for the 


extra projection matrix H € R *P2 and variance parameter 
7, respectively, 


p(n) = 1152 G (mlan, Bn) 
p(H\n) = [12N (h;l0, n71) (7) 


PCY) = G (lay, By) - 
The graphical models of DGMM are shown in Fig.1. Note 
that sparsity of the projection matrices B and H can be tuned 
by assigning suitable values to the hyper-parameters (a, 8+) 
and (a, Bn), respectively. 

By integrating out auxiliary latent variables Z, Eq.(6) can 
be shown to be equivalent to imposing a low-rank assumption 
on the covariance matrix Y in Eq.(3) (® = H'H+y7'!D, 
which allows decreasing the computational complexity. From 
another perspective, this low-rank assumption produces a full 
factorization of the variation in fMRI data into shared com- 
ponents Z and private components Z. The ability to identify 
what is shared and what is non-shared makes our model be 
good at suppressing noise and improving prediction perfor- 
mance. 

As short-hand notations, all hyper-parameters in the mod- 
el will be denoted by Q = {a7, 87, Ay, Bn, ay, By}, while 
the priors by = = {T, n, y} and the remaining variables by 
© = {B, H, Z, Z}. Dependence on Q is omitted for clarity 
throughout the paper. Then we can get the following posterior 
distribution using Bayes’ rule 


pe(X|Z)p(¥|Z, Z)p(O|=)p(E) 
pe(X, Y) ? 


where pe (X, Y) is the normalization constant. 


p(9,=|X, Y) = (8) 


4 Variational posterior inference 


Given above generative model, exact inference is in- 
tractable. Here we formulate a mean-field variational approx- 
imate inference method to infer the latent variables and mod- 
el parameters. Specifically, we assume there are a family of 
factorable and free-form (except for g(Z)) variational distri- 
butions 


q(9, =) = q(B)q(H)4q(Z)q(Z)a(r)a(n)a), 
and define q(Z) as a product of multivariate Gaussian distri- 
butions with diagonal covariance!, i.e., 


N N 
a(Z) = TT a(z:|x:) = TI N (ila (xi); diag(oz(x:))) , 


'We also considered to condition the posterior distribution q(Z) 
on both X and Y, but we didn’t observe obvious performance im- 
provement. 


where the mean pz(x;) = [Ma 1- -:; Haig]! and covati- 
ance diag(o2(x;)) = diag(o2,,,...,02;;-) are outputs of 
the recognition network specified by another DNN with pa- 
rameters Y. Then the objective is to get the optimal one which 
minimizes the Kullback-Leibler (KL) divergence between the 
approximating distribution and the target posterior, i.e., 


cee KL (q (O ,=)||pe(@, EIX, Y)), 


where P is the space of probability distributions. Equivalent- 
ly, we can also bound the marginal likelihood: 


log po(X, Y) 
= E,(0,5) [log po (X, Y, ©, =) — log q(®, 5)| 
+ KL (q4(9, =)||pe(9, E|X, Y)) 
(0,2) log po (X, Y, O, =) — log q(O, =)] 
= f 40, Blos SS + 10gp0(x10,5) 

+ log p(Y |O, =)|dOd= 
= Lp (0,2, X, Y) + Lx (9, =, X,Y) + £y (0,2, X,Y) 
= L(O,=,X, Y) (9) 


where we used the fact that KL divergence is guaranteed to 
be non-negative, and 


Lp(Q, z, X, Y) = —Drr(a9, E)||p(O, =)) 
Lx(9, = X, Y) = Č4(0,3) flog po(X]|O©, =)| 
Ly(9,5, X, Y) = E,e,5) [log p(Y|9, =). 
Intuitively, £ and Ly can be interpreted as the (negative) 
expected reconstruction errors of visual images and fMRI ac- 
tivity patterns, respectively. Maximizing this lower bound 
strikes a balance between minimizing reconstruction errors 


of two views and minimizing the KL divergence between the 
approximate posterior and the prior. 


IV 


4.1 Learning 0, » and Z 


Given the fixed-form approximate posterior distribution for 
factor Z, £p(Z, X, Y) can be computed exactly as: 


Lp(Z,X, Y)= Xi —Di ray (i|x:) ||P) 


N K 
= J 2 2 (1 + log(ožik) — Hzik — Zin) : 
On the other hand, £y(Z, X,Y) and £Ly(Z, X,Y) can 
be approximated by Monte-Carlo sampling[Kingma and 
Welling, 2014; Kingma et al., 2014]. Instead of sampling 
directly from q.(z;|x;), Zi; is computed as a deterministic 
function of x; and some noise term such that z; has the 


(1) 


desired distribution. Assuming we draw L samples, z; 


(l= 1,..., L) can be expressed as 
zi) = pi (x;) + o2(x;) © €, 
where e) ~ N (0,1) and © denotes element-wise multipli- 


cation. Then the resulting Monte-Carlo approximations are 
~N nl 
Lx (Z,X%, Y)= d= {Eq@,) [log pe (x:|z:)]} 
N L l 
= $ Dra Xia log polz), 


Ly(Z, X, Y)= Ei {Eq,) log p(yilzs)]} 


= 4 Era Dra bog p(yilz}”). 

Finally, the parameters of DNNs (0 and 4) can be obtained 
by optimizing the objective function £(Z, X, Y) (based on 
minibatches) using the standard stochastic gradient based 
optimization methods such as SGD, RMSprop or AdaGrad 
[Duchi et al., 2011]. 


4.2 Learning B, H, Z and = 
For a specific factor 7 (except for Z), it can be shown that 
when keeping all other factors fixed the optimal distribution 
q* (m) satisfies 
q* (T) x exp {Eq((,2}\n) [log po(X, Y, ©, =)]} . 


For our model, thanks to the conjugacy, the resulting optimal 
distribution of each factor follows the same distribution as the 
corresponding factor. 

The optimal distributions of the projection parameters can 
be found as a product of multivariate Gaussian distributions: 


d (B) = TIZA N (biluv, [rI + (Z27)]*) 
q (H) = TZ, N (bylan, [CI+ (7)(227)*) 


where notation (-) denotes the expectation operator, i.e., (7) 
means the expectation of 7 over its current optimal distribu- 
tion, and 


bp, = Ep, DM (1) lui — (] ) (Zi) (zi) 
Uun, = En, Daaa (bj ) (zi) (Zi). 


The optimal distribution of the auxiliary latent variables 
can also be found as a product of multivariate Gaussian dis- 
tributions: 


q* (Z) a Tes N 


(10) 


(7) (Yij _ 


(ziua (I+ (HH) ") aD 


Ba, = Ez; 052, (7) (vig — (bj) (z:)) (by). 


The optimal distributions of the precision variables can be 
formulated as: 


g(r) =I 9 (mlar + ¥, 8- + 3(bj b;)) 
a(n) = T1219 (nlan + $, By + $(h] h;)) 
iy) =G (ley + NP2 By + IY D d) 
(12) 
(b}) (zi) — (hy ) (zi). 


4.3 Convergence 


The inference mechanism sequentially updates the optimal 
distributions of the latent variables and the model parameters 
until convergence, which is guaranteed because the KL diver- 
gence is convex with respect to each of the factors. 


where ij = Yij — 


4.4 Prediction 


Using the estimated parameters, we can derive the predic- 
tive distribution for a visual image Xpreq given a new brain 
activity y,. The predictive distribution p(Xprealy») can be 
formulated as follows, 


ieee J pcrea (13) 


where the posterior distribution of latent variables p(z, |y») 
can be derived by 


P(Zx|¥x) = [rly etes.2.,B HL 1)pl2,)0@,) 
q (B)q* (H)q*(7)dz,dBdHdy (14) 


The posterior distribution p(z,|y.) = p(Z.)p(yx|Z+)/P(yx) 
can be equivalently obtained by solving the following infor- 
mation theoretical optimization problem: 


min, KL (q(2s)||p(zl¥+)) (15) 


q(Zs)E 


Expanding Eq.(15) and ignoring the term unrelated to g(z,), 
we further get 


Palza) [log p(yx |z..)]. 


min KL (q(z,)||p(@«)) — 
q(zx)EP 

To ensure the latent representations of testing instances are 
close to that of their neighbours from the training set, we 
adopt the posterior regularization[Zhu et al., 2014] strategy 
to incorporate the manifold regularization into the above pos- 
terior predictive distribution p(z,.|y.). Specifically, we define 
the following expected manifold regularization: 


rl ~N 
R(g(2e)) = EBqte,) [nea sillee — zil? 


where s; is some similarity measure of instances y; and y,. 
Here we use a k-nearest neighbor graph to effectively model 
local geometry structure in the input space and the affinity 
graph is defined as: 


lly»—yill? 
Si = = xy = ) > Yi N (yz); 


0, otherwise, 


where VV (yx) denotes the k-nearest neighbors of y,. 
Then our posterior regularization strategy can be formulat- 
ed as 


min KL (q(z,)||p(Z)) — Eqcz,) [log p(y«1Z.)] 
q(zs)EP 
+ pR(q(zx)), (16) 


where the parameter p > 0 controls the expected scale. As a 
direct way to impose constraints and incorporate knowledge 
in Bayesian models, posterior regularization is more natural 
and general than specially designed priors. However, directly 
solving Eq.(16) with R is difficult and inefficient. Let 


N 
h(zelp,s) = exp {=p Ds sills — zll? } 


then Eq.(16) can be rewritten as 


ae KL (q(Zs)| p(Zx)) z Salz.) [log p(yx|zx)] 


— Ez, [log h(z.|p, s)]. (17) 


Solving problem Eq.(17), we can get the posterior distribu- 
tion 


q 


P(E) P(Y¥ «|Z )h(Zx12; 8) 


pzy) = 
D(Yx) 
= | plysl2.,2-,B,H7)p(@.)h(ea|p.8)P@.) 
7 (B)q” (H)g* (y)dz,.dBdHdy (18) 


Because the multiple integral over the random variables z,, 
B, H and y is intractable, we replace the random variables 
B, H and y with the mean of estimated optimal distributions 
q* (B), q*(H) and q*(y¥), respectively, to vanish the integral 
over B, H and y. Then p(z,|y,,.) becomes 


paly) = J D(¥ ele, Zx)p(t) (2s |0,8)p(Ze)dz. (19) 


Now the posterior distribution p(z,|y.) can be found as: 


D(Zx1¥x) =N (z| Mas Be.) (20) 
where 
E2, = [(BTB") + (1 +p 5 si) 
Ha, = Xs, [(B) Tys + p Dini $:(2:)] 
T =71-7°(H!' (1+7(HH'))"'H). 
However, with the likelihood of the visual image 


Po (Xprea|Z+) formulated by a DNN, the integral over the la- 
tent variables z, (Eq.(13)) can not be computed analytically. 
Similar as in the training phase, we can approximate this in- 
tegral by Monte-Carlo sampling. Finally, the reconstructed 
visual image is calculated by taking the mean of all L predic- 


(1) 0) 


pred? where Xpred 


è Š —_ 1 L á 
tions, i.e., Xpred = Z 7-1 X is the outputs 


of the generative network, i.e., Od = g(z®). 


5 Experiments 


In this section, we present extensive experimental results 
on fMRI recording datasets to demonstrate the effectiveness 
of the proposed framework for perceived image reconstruc- 
tion from human brain activity. Specifically, we compare our 
DGMM with the following algorithms, which use either a 
shallow or a deep architecture: 


e Fixed Bases (Miyawaki et al.): a specially designed 
method to reconstruct visual images by combining local 
image bases of multiple scales (1 x 1,1 x 2,2 x 1, and 
2 x 2 pixels covering an entire image) [Miyawaki et al., 
2008]. The shapes of these predefined images bases are 
fixed, thus it may not be optimal for image reconstruc- 
tion. 


e Bayesian CCA (BCCA): a probabilistic extension of 
CCA model that relates the fMRI activity space to the 
visual image space via a set of latent variables [Fujiwara 
et al., 2013]. BCCA assumes a linear observation mod- 
el for visual images and a spherical covariance for the 
Gaussian distribution of fMRI voxels. 


e Deep Canonically Correlated Autoencoders 
(DCCAE): a latest deep multi-view representation 
learning model that consists of two autoencoders and 
optimizes the combination of canonical correlation 
between the learned bottleneck representations and the 
reconstruction errors of the autoencoders [Wang et al., 
2015]. DCCAE do not consider the cross-reconstruction 
errors between two views. 


e Deconvolutional Neural Network (De-CNN): a 
latest neural decoding method based on multivariate 
linear regression and deconvolutional neural network 
[Haiguang Wen and Liu, 2016; Zeiler et al., 2011]. It is 
a two-stage cascade model, i.e., it first predicts feature- 
maps by multivariate linear regression, then reconstruct 
images by feeding the estimated feature-maps in a pre- 
trained deconvolutional neural network. 


5.1 Experimental testbed and setup 


Data description. We conducted experiments on three pub- 
lic fMRI datasets obtained from Miyawaki et al. [Miyawa- 
ki et al., 2008] and van Gerven [Van Gerven et al., 2010b; 
Schoenmakers et al., 2013]. Dataset 1, consisting of contrast- 
defined 10 x 10 patches, contains two independent sessions 
[Miyawaki et al., 2008]. One is a ‘random image session’, 
in which spatially random patterns were sequentially present- 
ed. The other is a ‘figure image session’, in which alpha- 
betical letters and simple geometric shapes were sequentially 
presented. We used fMRI data from primary visual area V1 
of subject 1 (S1) for the analysis. Note that all comparing 
algorithms were trained on the data from ‘random image ses- 
sion’ and evaluated on the data from ‘figure image session’. 
Dataset 2 contains a hundred handwritten gray-scale digits (e- 
qual number of 6s and 9s) at a 28 x 28 pixel resolution taken 
from the training set of the MNIST database and the fMRI 
data from V1, V2 and V3 [Van Gerven et al., 2010b]. Dataset 
3 contains 360 gray-scale handwritten characters (equal num- 
ber of Bs, Rs, As, Is, Ns, and Ss) at a 56 x 56 pixel resolution 
taken from [Van der Maaten, 2009] and the fMRI data of V1, 
V2 taken from three subjects [Schoenmakers et al., 2013]. 
The visual images were downsampled from 56 x 56 pixels to 
28 x 28 pixels in our experiments. The details of the 3 data 
sets used in our experiments had been summarized in Table 
1. See [Miyawaki et al., 2008; Van Gerven et al., 2010b; 
Schoenmakers et al., 2013] for more information, including 
fMRI data acquisition and preprocessing. 


Voxel selection. Voxel selection is an important compo- 
nent to fMRI brain decoding because many voxels may not 
respond to the visual stimulus. A common approach is to 
choose those voxels that are maximally correlated with the 
visual images during training. We chose voxels for which the 
model provided better predictability (encoding performance). 


Table 1: The details of the 3 data sets used in our experiments. 
Datasets | #Instances |#Pixels | #Voxels| #ROIs  |#Training 
Dataset 1 1400 100 7197 vi 1320 


Dataset 2 100 784 3092 | V1, V2, V3 90 
Dataset 3 360 784 2420 V1, V2 330 


This codifies our intuition that the voxels better predicted 
with the visual images are those to be included in the decod- 
ing model. The goodness-of-fit between model predictions 
and measured voxel activities was quantified using the coef- 
ficient of determination (R?) which indicates the percentage 
of variance that is explained by the model. In experiments, 
we first computed the R? of each voxel using 10-fold cross- 
validation on training data, then voxels with positive R? were 
selected for further analysis. 


Parameter setting. The hyper-parameters of the proposed 
DGMM were set to (ar, 8) = (an, Bn) = (ay, By) = (1,1) 
for all data sets, while 5-fold cross validation was conduct- 
ed on training sets to choose better regularization parameter- 
s p from 2!-®°l, For fair comparison, model parameters of 
other methods had also been tuned carefully. In our experi- 
ments, we considered multiple layer perceptrons (MLPs) as 
the type of recognition models. Inspired by the selectivity 
of visual areas to feature maps of varying complexity [Güçlü 
and van Gerven, 2015; Haiguang Wen and Liu, 2016], we set 
the structures of the recognition network for visual images as 
“100-200, “784-256-128-10’ and ‘784-256-128-5’ for three 
data sets, respectively. Specially, we considered two types 
of the structures for DCCAE. One has an asymmetric shape 
(same setup as our model for image view and a single lay- 
er setup for fMRI view, DCCAE-A), which can mimic our 
model in structure and function. The other one has a sym- 
metric shape (same setup for both views, DCCAE-S), which 
can explore the deep nonlinear maps for fMRI data. 


5.2 Performance evaluation 


The reconstructed geometric shapes and alphabet letter- 
s, handwritten digits and handwritten characters by the pro- 
posed DGMM and other algorithms were shown in Fig.2, 
Fig.3 and Fig.4, respectively, where the first row denote pre- 
sented images, and below rows are the reconstructed images 
obtained from all comparing algorithms. 

Overall, the images reconstructed by DGMM captured the 
essential features of the presented images. In particular, they 
showed fine reconstructions for handwritten digits and char- 
acters. Although the reconstructed geometric shapes and al- 
phabet letters had some noise in the peripheral regions, the 
main shapes can be clearly distinguished. With the obtained 
reconstructions of handwritten digits and characters shared 
certain characteristics of their corresponding original images, 
there are subtle differences in the strokes. We attribute this 
phenomenon to the fact that manifold regularization imposed 
on the latent representations may change the details of re- 
constructed images. On the contrast, images reconstructed 
by Miyawaki’s method and BCCA were coarse for all im- 
age types with noise scattered over the entire reconstructed 
image. Also, both DCCAE-S and DCCAE-A produced dis- 
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Figure 4: Examples of reconstructed 18 distinct handwritten characters taken from subject 3 of Dataset 3. 
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Figure 2: Image reconstructions of geometric shapes and al- 
phabet letters taken from Dataset 1. 
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Figure 3: Image reconstructions of 10 distinct handwritten 
digits taken from Dataset 2. 


BCCA 


appointing reconstructions which often lacked shapes of the 
presented images, especially for geometric shapes and alpha- 
bet letters. This might be due to the fact that nonlinear maps 
will easily over-fit the voxel activities. 


To evaluate the reconstruction performance quantitatively, 
we used several standard image similarity metrics, including 
Pearson’s correlation coefficient (PCC), mean squared error 
(MSE) and structural similarity index (SSIM) [Wang et al., 
2004]. Note that MSE is not highly indicative of perceived 
similarity, while SSIM can address this shortcoming by tak- 
ing texture into account. In addition, we also performed im- 
age classification analysis to quantify the reconstruction ac- 
curacy from another perspective. Specifically, linear support 
vector machine (SVM) and convolutional neural network (C- 
NN) which had been trained on the presented visual images 
were used as the classifiers to label the reconstructed images. 
The classification accuracy of SVM (ACC-SVM) and CNN 
(ACC-CNN) on reconstructed images were reported. Perfor- 
mance comparisons were listed in Table 2. Note that we also 
listed the time consumed in training phase for all comparing 
algorithms in the last column for reference. Several observa- 
tions can be drawn as follows. 

First, by comparing DGMM against the other algorithms, 
we can find that DGMM performs considerably better on all 
three data sets. In particular, the SSIM values of DGMM 
significantly surpass the baseline algorithms in all cases. 

Second, by examining DGMM against BCCA which has a 
linear observation model for visual images, we can find that 
DGMM always outperform BCCA. This encouraging result 
shows that the DGMM with a DNN model for visual images 
is able to extract nonlinear features from visual images. 

Third, DGMM shows obvious better performance than 
DCCAE-A and DCCAE-S. Except for ignoring cross- 
reconstructions, it is also caused by the fact that a linear 
map between voxel activities and bottleneck representation 
is enough to achieve good performance, while the nonlinear 
maps are easily overfitting under the high dimensionality of 
limited fMRI data instances. 

Fourth, the performance of De-CNN is moderate for all 
data sets. We attribute this to the fact that it is a two-stage 
method which can’t obtain the global optimal result of model 
parameters. 

Finally, nearly 100% correct classification is possible for 
each algorithm on Dataset 2. We believe that it is caused 


Table 2: Performance of several image reconstruction methods on the test sets. Results were averaged over 20 random seeds 
and all subjects (mean+tstd). The best performance on each dataset was highlighted. 


Datasets Algorithms PCC MSE SSIM ACC-SVM ACC-CNN Time(s) 
Miyawaki et al. .609+.151 -162+.025 .237+.105 = = 19.4+1.1 
BCCA A438+.215 -253+.051 -181+.066 = = 74.9+3.0 
Dataset 1 DCCAE-A 455+.113 -234+.029 -166+.025 = = 211.8+7.5 
DCCAE-S 401+.100 -240-+.027 175+.011 = = 254.9+9.8 
De-CNN 469+.149 -263+.067 .224+.129 = = 108.2+2.2 
DGMM -611+.183 -159+.112 268.106 = = 118.4+2.5 
Miyawaki et al. -767+.033 -042-.007 .466+.030 1.00 1.00 39.9+1.2 
BCCA ALI+.157 -119+.017 -192+.035 1.00 1.00 20.7+1.0 
Dataset? DCCAE-A 548+.044 -074+.010 .358+.097 -900 .967+.047 12.7+0.3 
DCCAE-S 511.057 -080-+.016 552+.088 1.00 1.00 19.4+0.8 
De-CNN -799+.062 -038+.010 -613+.043 1.00 1.00 35.84+1.2 
DGMM -803+.063 .037+.014 -645.054 1.00 1.00 18.6+1.2 
Miyawaki et al. .481+.096 .067+.026 191.043 .655+.193 .655+.113 128.1+4.6 
BCCA 348+.138 -128+.049 .058+.042 .633+.034 -600+.098 32.9+1.0 
Dataset 3 DCCAE-A 354.167 -073+.036 -186+.234 A78+.126 -5334.072 38.1+1.1 
DCCAE-S 3514.153 -086.031 -179+.117 478.051 478+.155 59.5+1.8 
De-CNN 470+.149 -084-£.035 322+.118 589+.135 611+.128 96.8+2.0 
DGMM 498+.193 058.031 -340+.051 -7674.115 -778+ .083 42.4+4.2 


by the fact that digit 6 and 9 are easily to distinguish from 
each other. On Dataset 3, the remarkably higher classifica- 
tion performance on the images reconstructed by our model 
demonstrates the superiority of the proposed DGMM again. 


6 Conclusion and future works 


We have proposed a deep generative multiview framework 
to tackle the perceived image reconstruction problem. In our 
framework, multiple correspondences between visual image 
pixels and fMRI voxels can be found via a set of latent vari- 
ables. We also derived a predictive distribution that succeed- 
ed in reconstructing visual images from brain activity pattern- 
s. Although we focused on visual image reconstruction prob- 
lem in this paper, our framework can also deal with brain en- 
coding tasks. Extensive experimental studies have confirmed 
the superiority of the proposed framework. 

Two challenging and promising directions can be consid- 
ered in the future. First, considering the recurrent neural net- 
works (RNNs) [Chung et al., 2015] in our framework, we 
can explore the reconstruction of dynamic vision. Second, 
considering each subject’s fMRI measurements as one view, 
we can explore multi-subject decoding. 
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